Skip to content

feat(linux): add AMD MI300X ROCm bootstrap#8824

Draft
wenhug wants to merge 1 commit into
mainfrom
wenhug/amd-mi300x-rocm-bootstrap
Draft

feat(linux): add AMD MI300X ROCm bootstrap#8824
wenhug wants to merge 1 commit into
mainfrom
wenhug/amd-mi300x-rocm-bootstrap

Conversation

@wenhug

@wenhug wenhug commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

This is a draft PR for validating AMD MI300X ROCm bootstrap support in AgentBaker. It adds the first end-to-end wiring needed for an AKS node to identify itself as an AMD GPU node, install the ROCm host driver/runtime pieces during Linux CSE, and expose the node in a shape that the AMD device plugin can consume.

The CSE path added here is intentionally scoped to Ubuntu 24.04 amd64 and the MI300X SKUs we validated first:

  • Standard_ND96isr_MI300X_v5
  • Standard_ND96is_MI300X_v5

Main changes:

  • Adds AMD_GPU_NODE CSE environment plumbing from AgentBaker and aks-node-controller.
  • Routes AMD GPU nodes through a new ensureAmdGpuDrivers path before the existing NVIDIA driver path.
  • Installs a minimal ROCm host package set from the ROCm 7.2 Ubuntu 24.04 repos:
    • amdgpu-dkms
    • libdrm-amdgpu-dev
    • rocm-core
    • rocminfo
    • rocm-smi-lib
  • Configures the amdgpu kernel module to load on boot and removes stale amdgpu blacklist/install-false entries from /etc/modprobe.d.
  • Validates the node after install by checking DKMS state, modprobe amdgpu, /dev/kfd, /dev/dri/renderD*, rocminfo output for gfx942, and rocm-smi --showproductname output for AMD Instinct MI300X VF.
  • Removes the temporary ROCm apt sources and repo key after CSE installation so the node is not left pinned to repo.radeon.com.
  • Adds a VHD prebake proof-of-concept path behind the AMD_ROCM feature flag and extends Linux VHD content tests to verify the prebaked ROCm marker, packages, module config, binaries, and repo cleanup.

Validation performed:

  • make generate
  • git diff --check
  • go test ./pkg/agent
  • go test ./parser from aks-node-controller
  • Manual validation on a Standard_ND96isr_MI300X_v5 Ubuntu 24.04 VM in francecentral joined to an AKS cluster through AKSFlexNode.
  • Confirmed the node reached Ready with amd.com/gpu=8 capacity/allocatable after the AMD device plugin was installed.
  • Confirmed reboot recovery after fixing the amdgpu blacklist issue: /dev/kfd and /dev/dri/renderD* returned automatically, the flex node agent/nspawn services recovered, and the node became schedulable again.
  • Ran GPU smoke pods after reboot; the workload completed successfully and observed gfx942 via rocminfo plus AMD Instinct MI300X VF via rocm-smi.

Draft notes / open follow-ups:

  • This PR is for early CSE and VHD-path validation, not final production rollout.
  • The current CSE path downloads from repo.radeon.com; before production this likely needs an AKS-approved package source or mirror/cache decision.
  • RP/nodepool product wiring is intentionally not included here.
  • The VHD prebake path is included as a proof of concept so we can compare CSE runtime install versus prebaked image behavior and provisioning latency.
  • Additional ShellSpec coverage may be worth adding once the final package source and rollout shape are agreed.

Which issue(s) this PR fixes:

N/A

Copilot AI review requested due to automatic review settings July 2, 2026 23:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds initial end-to-end wiring for AMD MI300X ROCm enablement on AKS Linux nodes by introducing an AMD_GPU_NODE CSE signal, an AMD driver install/validation path in Ubuntu CSE, and an optional VHD prebake proof-of-concept behind an AMD_ROCM feature flag.

Changes:

  • Plumbs AMD_GPU_NODE from AgentBaker and aks-node-controller into the Linux CSE environment and routes AMD GPU nodes through a new ensureAmdGpuDrivers path.
  • Implements Ubuntu 24.04-specific ROCm/AMDGPU installation + validation logic (DKMS/module/device nodes + rocminfo/rocm-smi checks) and cleans up temporary apt repo configuration afterward.
  • Adds a VHD prebake path gated by AMD_ROCM and extends the Linux VHD content tests to verify the prebaked marker/packages/module config and repo cleanup.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vhdbuilder/scripts/linux/ubuntu/tool_installs_ubuntu.sh Adds ROCm/AMDGPU VHD prebake functions (repo setup/cleanup, module autoload, validation, install).
vhdbuilder/packer/test/linux-vhd-content-test.sh Adds a new testAmdRocmPrebake gated by AMD_ROCM to validate prebaked ROCm state and repo cleanup.
vhdbuilder/packer/install-dependencies.sh Wires the AMD_ROCM feature flag to invoke installAmdRocmPrebake during VHD build and logs the marker.
pkg/agent/variables.go Adds an amdGpuNode CSE variable derived from EnableAMDGPU.
pkg/agent/variables_test.go Adds unit tests asserting amdGpuNode string output.
pkg/agent/baker_test.go Adds a Linux CSE command test asserting AMD_GPU_NODE=true for an MI300X SKU config.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Adds ROCm/AMDGPU install + validation logic for Ubuntu 24.04 amd64, SKU gating, marker writing, and repo cleanup.
parts/linux/cloud-init/artifacts/cse_main.sh Routes AMD_GPU_NODE=true through ensureAmdGpuDrivers before the NVIDIA driver path.
parts/linux/cloud-init/artifacts/cse_helpers.sh Adds AMD ROCm-related error codes for the CSE path.
parts/linux/cloud-init/artifacts/cse_cmd.sh Emits AMD_GPU_NODE into the CSE environment.
aks-node-controller/parser/parser.go Adds AMD_GPU_NODE to the generated CSE environment map.
aks-node-controller/parser/parser_test.go Extends parser tests to validate AMD_GPU_NODE presence/values in env.
aks-node-controller/parser/helper.go Adds getEnableAmdGpu helper for config parsing.

Comment on lines +277 to +280
cat > /etc/apt/sources.list.d/rocm.list <<EOF
deb [arch=amd64 signed-by=${rocm_gpg_keyring_path}] https://repo.radeon.com/rocm/apt/${rocm_version} ${ubuntu_codename} main
deb [arch=amd64 signed-by=${rocm_gpg_keyring_path}] https://repo.radeon.com/graphics/${rocm_version}/ubuntu ${ubuntu_codename} main
EOF
Comment on lines +353 to +357
ensureAmdGpuDrivers() {
local rocm_version="${AMD_ROCM_VERSION:-7.2.4}"
local amdgpu_repo_version="${AMD_ROCM_AMDGPU_REPO_VERSION:-30.30.4}"
local amdgpu_dkms_version="${AMD_ROCM_AMDGPU_DKMS_VERSION:-1:6.16.13.30300400-2341068.24.04}"
local libdrm_amdgpu_dev_version="${AMD_ROCM_LIBDRM_AMDGPU_DEV_VERSION:-1:2.4.125.07020400-2341098.24.04}"
Comment on lines +243 to +246
cat > /etc/apt/sources.list.d/rocm.list <<EOF
deb [arch=amd64 signed-by=${rocm_gpg_keyring_path}] https://repo.radeon.com/rocm/apt/${rocm_version} ${ubuntu_codename} main
deb [arch=amd64,i386 signed-by=${rocm_gpg_keyring_path}] https://repo.radeon.com/graphics/${rocm_version}/ubuntu ${ubuntu_codename} main
EOF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants